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Abstract 

Canonical correlation analysis (CCA) describes the associations between two sets of vari¬ 
ables by maximizing the correlation between linear combinations of the variables in each data 
set. However, in high-dimensional settings where the number of variables exceeds the sample 
size or when the variables are highly correlated, traditional CCA is no longer appropriate. 
This paper proposes a method for sparse CCA. Sparse estimation produces linear combina¬ 
tions of only a subset of variables from each data set, thereby increasing the interpretability 
of the canonical variates. We consider the CCA problem from a predictive point of view 
and recast it into a regression framework. By combining an alternating regression approach 
together with a lasso penalty, we induce sparsity in the canonical vectors. We compare the 
performance with other sparse CCA techniques in different simulation settings and illustrate 
its usefulness on a genomic data set. 
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1 Introduction 


The aim of canonical correlation analysis (CCA), introdnced by Hotelling (1936), is to identify 
and qnantify linear relations between two sets of variables. CCA is nsed in varions research helds 


to stndy associations, for example, in physical data (Pison and Van Aelst, 2004), biomedical data 


(Knstra, 2006), or environmental data (laci et al., 2010). One searches for the linear combinations 
of each of the two sets of variables having maximal correlation. These linear combinations are called 
the canonical variates and the correlations between the canonical variates are called the canonical 


correlations. We refer to e.g. Johnson and Wichern (1998, Chapter 10) for more information on 
canonical correlation analysis. 

At the same time, we want to indnce sparsity in the canonical vectors snch that the linear 
combinations only inclnde a subset of the variables. Sparsity is especially helpfnl in analyzing 
associations between high-dimensional data sets, which are commonplace today in, for example. 


genetics (Qi et ah, 2014) and machine learning (Snn et ah, 2011 Lin et ah, 2014). Therefore, 
we propose a sparse version of CCA where some elements of the canonical vectors are estimated 
as exactly zero, which eases interpretation. For this aim, we nse the formnlation of CCA as a 
prediction problem. 

Consider two random vectors x G and y G M'^. We assnme, withont loss of generality, that 
all variables are mean centered and that p < q. Denote the joint covariance matrix of (x,y) by 


E = 


y y 

■^xx '^xy 

V V 

^yx ^yy 


( 1 ) 


with r = rank[Hy^y) < p. Let A G and B G be the matrices with in their colnmns 
the canonical vectors. The new variables u = A'^x and v = B'^y are the canonical variates 
and the correlations between each pair of canonical variates give the canonical correlations. The 
canonical vectors contained in the matrices A and B are respectively given by the eigenvectors of 
the matrices 

^xx^xySyySyx and Syy SyxSx-xSxy. (2) 

Both matrices have the same positive eigenvalnes, the canonical correlations are given by the 
positive sqnare root of those eigenvalnes. 

The canonical vectors and correlations are typically estimated by taking the sample versions of 
the covariances in ([^ and compnting the corresponding eigenvectors and eigenvalnes. However, to 
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implement this procedure, we need to invert the matrices Sxx and Syy. When the original variables 
are highly correlated or when the number of variables becomes large compared to the sample size, 
the estimation imprecision will be large. Moreover, when the largest number of variables in both 


data sets exceeds the sample size (i.e. q > n), traditional CCA cannot be performed. Vinod (1976) 


proposed the canonical ridge, which is an adaptation of the ridge regression concept of Hoerl and 


Kennard (1970) to the framework of CCA, to solve this problem. The canonical ridge replaces the 


matrices and Syy by respectively (Sxx + ^il) and (Syy + ^ 2 !) . By adding the penalty 

terms ki and ^2 to the diagonal elements of the sample covariance matrices, one obtains more 
reliable and stable estimates when the data are nearly or exactly collinear. 


Another approach is to use sparse CCA techniques. Parkhomenko et ah (2009) consider a sparse 
singular value decomposition to derive sparse singular vectors. A limitation of their approach is 
that sparsity in the canonical vectors is only guaranteed if Sxx and Syy are replaced by their 


corresponding diagonal matrices. A similar approach was taken by Witten et ah (2009) who 
apply a penalized matrix decomposition to the cross-product matrix S^.^, but assume that one 


can replace the matrices Sxx and Syy by identity matrices. Waaijenborg et ah (2008) consider 


Wold’s (1968) alternating least squares approach to CCA and obtain sparse canonical vectors using 
penalized regression with elastic net. The ridge parameter of the elastic net is set to be large, 
thereby, according to the authors, ignoring the dependency structure within each set of variables. 


Waaijenborg et ah (2008), Witten et al.| (2009), and Parkhomenko et ah (2009) all impose 


covariance restrictions, i.e. Sxx = Syy = I for Waaijenborg et ah (2008) and Witten et ah (2009); 


diagonal matrices for Parkhomenko et al. (2009). In contrast, we propose in this paper to estimate 
the canonical variates without imposing any prior covariance restrictions. Our proposed method 
obtains the canonical vectors using a alternating penalized regression framework. By performing 
variable selection in a penalized regression framework using the lasso penalty ( Tibshirani| 1996| , 
we obtain sparse canonical vectors. 

We demonstrate in a simulation study that our Sparse Alternating Regression (SAR) algorithm 
produces good results in terms of estimation accuracy of the canonical vectors, and detection of 
the sparseness structure of the canonical vectors. Especially in simulation settings when there is 
a dependency structure within each set of variables, the SAR algorithm clearly outperforms the 
sparse CCA methods described above. We also apply the SAR algorithm on a high-dimensional 
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genomic data set. Sparse estimation is appealing since it highlights the most important variables 
for the association study. 

The remainder of this article is organized as follows. In Section 2 we formulate the CCA 
problem from a predictive point of view. Section 3 describes the Sparse Alternating Regression 
(SAR) approach and provides details on the implementation of the algorithm. Section 4 compares 
our methodology to other sparse CCA techniques by means of a simulation study. Section 5 
discusses the genomic data example, Section 6 concludes. 


2 CCA from a predictive point of view 

A characterization of the canonical vectors based on the concept of prediction is proposed by 


Brillinger 

(1975 

) and |Izenman 

(1975) 


( 3 ) 


consider the optimization problem 

n 

(A, B) = argmin 11 A'^Xi — B^yi| |^. 

(A,B)e5 

We restrict the parameter space to the space S, given by 

5 = {(A,B) : A e e rank(A) = rank(B) = r, A'^E^xA = B^SyyB = R}. 

We impose normalization conditions requiring the canonical variates to have unit variance and to 
be uncorrelated. Brillinger (1975) proves that the objective function in ([^ is minimized when A 
and B contain in their columns the canonical vectors. 

We build on this equivalent formulation of the CCA problem to obtain the canonical vec¬ 


tors using an alternating regression procedure (see e.g. Wold, 1968 Branco et ah, 2005). The 
subsequent canonical variates are sequentially derived. 


First canonical vector pair. Denote the hrst canonical vectors (i.e. the hrst columns of the 
matrices A and B) by (Ai, Bi). Suppose we have an initial value A^ for the hrst canonical vector 
in the matrix A. Then the minimization problem in ([^ reduces to 

n 

BiIAi = argmin^ (Ai^Xi - Bi^yi)^ (4) 

i=i 
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where we require vi = YBi to have unit variance. The solution to Q can be obtained from 
a multiple regression with XA^ as response and Y as predictor, where X = [xi,... ,Xn]^ and 

Y = 

Analogously, for a hxed value B^. The optimal value for Ai is obtained by a multiple regression 
with YB^ as response and X as predictor 

n 

Ai|B* = argmin V (B*^yi - A/xi)^ (5) 

i=i 

where we require ui = XAi to have unit variance. This leads to an alternating regression scheme, 
where we alternately update our estimates of the hrst canonical vectors until convergence. We 
iterate until the angle between the estimated canonical vectors in iteration i and the respective 
estimated canonical vectors in the previous iteration are both smaller than some value e (e.g. 
e = 10-=^). 


Higher order canonical vector pairs. The higher order canonical variates need to be orthogonal 
to the previously found canonical variates. Therefore, the alternating regression scheme is applied 


on deflated data matrices (see e.g. Branco et ah, 2005). For the second pair of canonical vectors, 
consider the deflated matrices 

X* = X-ui(ufui)-^u[X. (6) 


The deflated matrix X* is obtained as the residuals of the multivariate regression of X on ui, the 
hrst canonical variate. Analogously, the debated matrix Y* is given by 


Y* 


Y-vi(v(vi) ^vfY, 


(7) 


the residuals of the multivariate regression of Y on vi. 

Using the Least Squares property, each column of X* is uncorrelated with the hrst canonical 
variate Ui. The second canonical variate will be a linear combination of the columns of X* and, 
hence, will be uncorrelated to the previously found canonical variate. The same holds for Y*. The 
second canonical variate pair is then obtained by alternating between the following regressions 
until convergence: 

n 

B; I a; = argmin ^ (A^^xt - B^^y ?) ^ (8) 
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n 


AfxO^ 


(9) 


A;|B; = argmin V - 

A * ' ^ 

^2 i=l 


where we require V 2 = Y*B 2 and U 2 = X*A 2 to have both unit variance. 

Finally, we need to express the second canonical vector pair in terms of the original data sets 
X and Y. To obtain the second canonical vector A 2 , we regress Ug ^ 


A 2 


argmin (u^ — A 2 ^Xi)" 

A2 


i=l 


( 10 ) 


yielding the htted values U 2 = XA 2 . To obtain B 2 , we regress V 2 on Y. 


n 

§2 = argmin V (v^ - B 2 ^yi)^ . 

j=i 

The same idea is applied to obtain the higher order canonical variate pairs. 


( 11 ) 


3 Sparse alternating regressions 

The canonical vectors obtained with the alternating regression scheme from Section 2 are in general 
not sparse. Sparse canonical vectors are obtained by replacing the Least Squares regressions in 
the alternating regression approach of Section 2 with Lasso regressions (Li-penalty). As such, 
some coefficients in the canonical vectors will be set to exactly zero, thereby producing linear 
combinations of only a subset of variables. 

For the hrst pair of sparse canonical vectors, the sparse equivalents of the Least Squares 
regressions in equations (|^ and ([^ are given by 

n q 

Bi|Ai = argmin^ (Ai^x; - Bi'^y;)^ + Abi \hji\, 
i=l j=l 

n p 

Ai|Bi = argmin (B^^yi - ai^Xi)^ + Aai ^ |aji|, 

i=i j=i 

where Abi > 0 and Aaj > 0 are sparsity parameters, bji is the (j = 1,... ,q) element of the 
hrst canonical vector Bi and a^i is the {j = 1,... ,p) element of the hrst canonical vector Ai. 
The hrst pair of canonical variates are given by ui = XAi and vi = YBi. We require both to 
have unit variance. 


6 


To obtain the second pair of sparse canonical vectors, the same deflated matrices as in equations 
and Q are used. The Least Squares regressions in equations (|^ and ([^ are replaced by the 
Lasso regressions 

n q 

B;|A; = argmin^ + Ab* Ib^al 

i=l j = l 


A 2 IB; = argrnin^ - A^^Xj)^ + Aa| |a* 2 l- 

^2 i=i j=i 

Finally, to express the second pair of canonical vectors in terms of the original data matrices, we 
replace the Least Squares regression in (10) and ( pT| by the two Lasso regressions. 

n p 

A 2 = argmin^ (u^ - A2'^Xi) + Aa2 ^ \^j2\, 

A.2 -1 • T 

1=1 J=1 


B 2 = argminV] (v^ - Ba^yi)^ + Ab 2 Y] |bj 2 |, 


2=1 


yielding the htted values U 2 = XA 2 and V 2 = YB 2 . We add a lasso penalty to the above 
regressions, first because the design matrix X can be high-dimensional, and second, because we 
want A 2 and B 2 to be sparse. 

A complete description of the algorithm is given below. We numerically verihed that without 
imposing penalization (i.e. Aaj = A^j = 0, for j = l,...,r), the traditional CCA solution 
is obtained. Our numerical experiments all converged reasonably fast. Finally, note that as in 


other sparse CCA proposals (Witten et ah, 2009; Parkhomenko et ah, 2009 Waaijenborg et al. 


2008) the sparse canonical variates are in general not uncorrelated. We do not consider this lack 


of uncorrelatedness as a major flaw. The sparse canonical vectors yield an easily interpretable 
basis of the space spanned by the canonical vectors. After suitable rotation of the corresponding 
canonical variates, this basis can be made orthogonal (but not sparse) if one desires so. 
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Sparse Alternating Regression (SAR) Algorithm 

Let X and Y be two data matrices. 

1. Preliminary steps 


• Xo = X - Ix^ 

. Yo = Y - ly^ 

2. Alternating Regressions: For I = 1,... ,r 

• If ? > 1 : Deflated matrices 

• = X/_2 — U;_i(u^^U/_i)“^U^^X/_2 

• Y;_i = Y/_2 - V/-l(v^lV/_i)“lV;^jYi_2 

• Starting values 


• using the canonical vector 6“”’’’ 

• V, - 

From iteration s = 1 until convergence 

• a["^ = argminX:r=i(v[7^^ - Ei=i 


using the canonical vector obtained with the canonical ridge 


j=i l^il 


*(«) ^ sr' 

' ||aW|| 

_ ’V' 


= argminX:”=i(u|y - + \b,i Y.]=i l^il 


, u(^) _ K’’ 

' - i|b(-)|| 

. vy)=Y,_iby) 

• After convergence, resulting in a*, b*, u* and v* 
• U/_i = [ui,.. .,ui-i] 


• Ul = 


nr - Uz_i (uf_iU,_i) 


• Ai = 


slI ii 1 = 1 

argminX;”=i(ui,i - + Xa,i Ej=i \^j\ if ^ > 1 


Ul = XoA; 

V;_i = [ui, . . . ,U;_i] 


• 0; = vf - V;_ 


Vf_ 


• Bi = 


b* if ; = 1 

argminX;”=i(v/.* - + As,/ Y.j=i if ^ > 1 

B 


• vi= YqB/ 


3. Final solution 


* -^sparse [-^1 7 • ■ • ? 

* Bgparse — [Bl, • • • , B^ 




Starting values. To start up the Sparse Alternating Regression (SAR) algorithm, an initial 
value is required. We use the canonical vectors delivered by the canonical ridge as starting value, 
which is available at no computational cost. The regularization parameters of the canonical ridge 
are chosen using 5-fold cross-validation such that the average test sample canonical correlation is 


maximized (Gonzalez et ah, 2008). 


We performed a simulation study (unreported) to assess the robustness of the SAR algorithm 
to different choices of starting values. The SAR algorithm shows similar performance when either 
the canonical ridge or other choices of starting values (i.e. CCA in low-dimensional settings and 
randomly drawn starting values) are used. 

Number of canonical variates to extract. For practical implementation, one needs to have an 
idea on the number of canonical variates r to extract. Most often, only a limited number of 


canonical variate pairs are truly relevant. We follow An et al. (2013) who propose the maximum 


eigenvalue ratio criterion to decide on the number of canonical variates to extract. We apply the 
canonical ridge and calculate the canonical correlations pi,..., prmax, with rmax = min{p, q). Let 
kj = pj/pj+i for j = 1,..., rmax — 1. Then we set r = aigmaXjkj, and extract r pairs of canonical 
variates using the SAR algorithm. 


Selection of sparsity parameters. In the SAR algorithm, the sparsity parameters Xaj and Xbj 
{ j = l,...,r), which control the penalization on the respective regression coefficient matrices, 
need to be selected. We select the sparsity parameters according to a minimal Bayes Information 
Criterion (BIC). We solve the corresponding penalized regression problems over a range of values 
and select for each the one with lowest value of 


. = -2\ogLx^ . + log(n), 

BlCxgj = -21ogLAfl,, + log(n), 

for j = 1,... ,r. Ta^j is the estimated likelihood using sparsity parameter Xaj and kx^^ is the 
number of non-zero estimated regression coefficients. Analogously for Xbj- 
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4 Simulation Study 


We compare the performance of the Sparse Alternating Regression approach with three other 
sparse CCA techniqnes. We consider 

• The Sparse Alternating Regression (SAR) algorithm detailed in Section 3. 


The sparse CCA of Witten et al. (2009 , relying on a penalized matrix decomposition applied 
to the cross-prodnct matrix Tlxy. Sparsity parameters are selected using the permutation 


approach described in Gross et al. (2011). 


The sparse CCA of Parkhomenko et al. (2009 ^ Sparsity parameters are selected using 


5-fold cross-validation where the average test sample canonical correlation is maximized. 


• The sparse CCA of Waaijenborg et al. (2008l|^ The lasso parameter of the elastic net is 
selected using 5-fold cross-validation such that the mean absolute difference between the 
canonical correlation of the training and test sets is minimized. 

We emphasize that the sparsity parameters of all methods are selected as proposed by the respec¬ 
tive authors. The traditional CCA solution and the canonical ridg^are computed as additional 
benchmarks. 

We consider several simulation schemes. For each setting we generate data matrices X and 
Y according to multivariate normal distributions, with covariance matrices described in Table 
The number of simulations for each setting is M = 1000. In all simulation settings, the canonical 


vectors have a sparse structure. In the first simulation setup (revised from Branco et ah, 2005) 
the covariance restrictions of Waaijenborg et al. (2008), Witten et al.[ (|2009) and Parkhomenko 


et al. (2009) (i.e. Sxx = Syy = I for the former two, diagonal matrices for the latter) are 


satisfied. These restrictions are violated in the second, third and fourth simulation setup. In the 
third design, the number of variables is large compared to the sample size. Traditional CCA can 
still be performed in this setting. In the fourth design, the number of variables in the data matrix 
Y is larger than the sample size, and traditional CCA can no longer be performed. 


^Available in the R package PMA ( 

Witten et al. 

2011 

1 . 



^Available at http://www.uhnres.utoronto.ca/labs/tritchler/ 


^We re-implemented the algorithm of 

Waaijenborg et al. 

(200 

81 in R. 

^Available in the R package CCA ( 

(Jonzalez and Dejean 

2009 

. 


10 


























































Performance measures. We compare the SAR algorithm to its alternatives and evaluate (i) the 
precision accuracy of the space spanned by the estimated canonical vectors, and (ii) the detection 
of the sparsity structure of the canonical vectors. 

We compute for each simulation run m, with m = 1,..., M, the angle A) between 

the subspace spanned by the estimated canonical vectors contained in the columns of A™ and the 
subspace spanned by the true canonical vectors contained in the columns of A. Analogously for 
the matrix B. The average angles are given by 

M 1 ^ 

«(A,A) = -5^«'”(A’",A) and e(B, B) = - B), 

m=l m=l 


Finally, we monitor the sparsity recognition performance (e.g. Rothman et ah, 2010) using the 
true positive rate and the true negative rate as defined as follows 


TPR{A,A) 

TNR{A,A) 


#{(bj) : Ajj A 0 and Ajj A 0} 
#{(bi) : Aij A 0} 
#{(bj) : Ajj = 0 and Ajj = 0} 

#{(z,j) : Aij = 0} 


The true positive rate indicates the number of true relevant variables detected by the estimation 
procedure. The true negative rate measures the hit rate of excluding unimportant variables from 
the canonical vectors. Analogue measures can be computed for the canonical vectors in the matrix 

B. 


Results. The simulation results on the estimation accuracy of the estimated canonical vectors 
are reported in Table We compute the average angle (averaged across simulation runs) between 
the space spanned by the true and estimated canonical vectors. To compare the average angle 
of the SAR algorithm against the other approaches, we compute p-values of a two-sided paired 
t-test. 

We hrst compare the performance of the penalized CCA techniques (i.e. canonical ridge and 
sparse CCA) to the unpenalized CCA solution. The estimation accuracy of the penalized CCA 
methods is significantly better compared to traditional CCA, especially in the high-dimensional 
design. In the lower dimensional simulation settings (i.e. uncorrelated and correlated design), 
sparse CCA techniques are still doing well since the underlying structure of the canonical vectors 
is sparse. 
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Table 2: Estimation accuracy of the canonical vectors, measured by the average angle between 
the subspace spanned by the true and estimated canonical vectors. P-values comparing SAR to 
alternatives are all < 0.01. 


Design 

Method 



e{A,A) 

0(B,B) 

Uncorrelated 

SAR 



0.008 

0.019 


Witten et al. 

(2009 


0.010 

0.054 


Parkhomenko et al. 

(200C 

) 0.104 

0.241 


Waaijenborg et al. 

2008) 

0.078 

0.212 


Canonical ridge 


0.129 

0.272 


CCA 



0.127 

0.267 

Correlated 

SAR 



0.001 

0.068 


Witten et al. 

(2009 


0.061 

0.299 


Parkhomenko et al. 

(200C 

) 0.306 

0.690 


Waaijenborg et al. 

2008) 

0.187 

0.494 


Canonical ridge 


0.046 

0.043 


CCA 



0.042 

0.033 

High-dimensional 

SAR 



0.212 

0.305 


Witten et al. 

(2009 


0.263 

0.390 


Parkhomenko et al. 

(200C 

) 0.826 

0.908 


Waaijenborg et al. 

2008) 

0.833 

0.942 


Canonical ridge 


0.916 

1.016 


CCA 



1.062 

1.193 

Overparametrized 

SAR 



0.278 

0.353 


Witten et al. 

(2009 


0.490 

0.592 


Parkhomenko et al. 

(200C 

) 0.946 

0.986 


Waaijenborg et al. 

2008) 

1.137 

1.181 


Canonical ridge 


0.916 

1.211 
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Next, we compare the SAR algorithm to its sparse alternatives. In the uncorrelated design, 


the covariance restrictions imposed by Waaijenborg et ah (2008), Parkhomenko et ah (2009) and 


Witten et al. (2009) are satished. Therefore, we expect these methods to perform especially 


well. Nevertheless, even in this setting, the SAR algorithm performs signihcantly better. In 
the correlated design, the high-dimensional and the overparametrized design these covariance 
restrictions are violated. Here, we see even more clearly that the SAR algorithm has a signihcant 
advantage over its sparse alternatives. In the correlated design, for instance, the SAR algorithm 


outperforms the method of Witten et al. (2009) by a factor 10 for the hrst canonical vector (i.e. 


estimation accuracy of 0.001 against 0.061), and by a factor 5 for the second canonical vector 
(i.e. estimation accuracy of 0.068 against 0.299). The gains in estimation accuracy of the SAR 
algorithm compared to the other sparse CCA methods are even more outspoken. 

Finally, Table compares the results on sparsity recognition performance among the sparse 


CCA techniques. The methods of Parkhomenko et al. (2009) and Waaijenborg et al. (2008) 


produce the least sparse solution, indicated by the high true positive rates and low true negative 


rates. The SAR algorithm and the method of Witten et al. (2009) tend to produce the most spare 


solutions, indicated by the high true negative rates and low true positive rates. Contrary to sparse 
CCA, traditional CCA and the canonical ridge don’t perform variable selection simultaneously 
with model estimation. Therefore, traditional CCA and canonical ridge are not included in Table 
1^ All elements of the canonical vectors are estimated as non-zero, resulting in a perfect true 
positive rate and zero true negative rate. 

To conclude, as we can see from Table in every simulation design we consider, the SAR 
algorithm did perform signihcantly better than the other sparse CCA methods. 


5 Genomic data application 

In recent years, high-dimensional genomic data sets have arisen, containing thousands of gene 


expression and other phenotype measurements (e.g.. 

Chen et al., 

2010 

Daye et al. 

2012 

). We 

use the publicly available breast cancer data set described in 

Chin et al. 

(2006) and available in 


the R package PMA (Witten et al., [2011 ). Comparative genomic hybridization (CGH) data (2149 
variables) and gene expression data (19 672 variables) are available on 89 samples. The objective is 
to identify copy number change variables that are correlated with a subset of gene expression vari- 
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Table 3: Sparsity recognition performance: trne positive rate and true negative rate for canonical 
vectors in the A and B matrices. 


Design 

Method 




TPR 

A 

TNR 

B 

TPR 

TNR 

Uncorrelated 

SAR 




0.79 

0.81 

0.79 

0.86 


Witten et al. 

(2009 



0.75 

0.83 

0.77 

0.78 


Parkhomenko et al. 

(200C 

) 

0.94 

0.22 

0.93 

0.25 


Waaijenborg et al. 

2008) 


0.91 

0.25 

0.91 

0.25 

Correlated 

SAR 




0.83 

0.92 

0.55 

0.93 


Witten et al. 

(2009 



0.52 

0.80 

0.43 

0.75 


Parkhomenko et al. 

(200C 

) 

0.86 

0.23 

0.83 

0.26 


Waaijenborg et al. 

2008) 


0.86 

0.31 

0.83 

0.31 

High-dimensional 

SAR 




0.54 

0.82 

0.51 

0.83 


Witten et al. 

(2009 



0.38 

0.87 

0.31 

0.86 


Parkhomenko et al. 

(200C 

) 

0.72 

0.35 

0.70 

0.40 


Waaijenborg et al. 

2008) 


0.87 

0.24 

0.82 

0.27 

Overparametrized 

SAR 




0.44 

0.89 

0.44 

0.89 


Witten et al. 

(2009 



0.37 

0.82 

0.32 

0.82 


Parkhomenko et al. 

(200C 

) 

0.61 

0.49 

0.56 

0.55 


Waaijenborg et al. 

2008) 


0.90 

0.16 

0.87 

0.18 
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ables. Copy number changes on a particular chromosome are associated with expression changes 


in genes located on the same chromosome (Witten et al., 2009). Therefore, we analyze the data 
for each chromosome separately, each time using the CGH and gene expression variables for that 
particular chromosome. The dimension of both sets of variables is large compared to the sample 
size such that traditional CCA cannot be performed. In such high-dimensional setting, the use of 
sparse CCA techniques is appealing. We use the SAR algorithm to perform sparse CCA for each 
chromosome separately. 

To decide on the number of canonical variates pairs to extract, we apply the canonical ridge 
to each chromosome. Figure shows the hrst 20 estimated canonical correlations for each of the 
23 chromosomes. For each chromosome, we use the maximum eigenvalue ratio criterion, discussed 
in Section 3, to determine the number of canonical variate pairs to extract. Depending on the 
specihc chromosome, this criterion indicates to extract either 1, 2, 3 or 4 canonical variate pairs. 

To compare the performance of the SAR algorithm to the other sparse CCA procedures dis¬ 
cussed in Section 4, we perform an out-of-sample cross-validation exercise. More precisely, we 
perform a leave-one-out cross-validation exercise and compute the cross-validation score 


n 


i=l 

where and contain the estimated canonical vectors when the observation is left out 
of the estimation sample. We compute this cross-validation score for each of the sparse CCA 
techniques. The technique that leads to the lowest value of this cross-validation score achieves the 
best out-of-sample performance. 

Averaged across all chromosomes, the SAR algorithm attains a cross-validation score of 87.21, 


the method of Witten et al. (2009) 367.38, Parkhomenko et al. (2009) 2778.57 and Waaijenborg 


et al. (2008) 713.57. Thus, the SAR algorithm outperforms its alternatives. Furthermore, we 


compute relative cross-validation scores, being the cross-validation score of a method relative 
to the cross-validation score of the SAR algorithm. Boxplots of these relative cross-validations 
scores (23 scores, one for each chromosome) are presented in Figure A value of the relative 
cross-validation score larger than 1 (horizontal red line) indicates better performance of the SAR 
algorithm. The SAR algorithm always attains the best cross-validation score, except for two cases 


(out of 23) where Witten et al. (2009) achieves the lowest cross-validation score. The differences 


in performance compared to the method of Parkhomenko et al. (2009) and Waaijenborg et al. 
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Figure 1: Estimated canonical correlations using the canonical ridge, for each of the 23 chro¬ 
mosomes. The highest order pair of canonical variates to retain, as selected by the maximum 
eigenvalue ratio criterion, is indicated by a solid black circle. 
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Relative cross-validation scores 



Figure 2: Boxplots of the 23 relative cross-validation scores of Witten et al. (2009), Parkhomenko 


et al.[ (2009) and Waaijenborg et ah, relative to the SAR algorithm. 


(2008) are large. The cross-validation scores obtained with the SAR algorithm and the method of 


Witten et al. (2009) are substantially lower than those obtained with the method of Parkhomenko 


et al. (2009) and Waaijenborg et al. (2008). The solutions obtained with the former two are 


much sparser than the once obtained with the later two. Sparsity thus helps in achieving a good 
cross-validation score. 

The dependency structure within each set of variables might explain the good performance of 
the SAR algorithm relative to its alternatives. For the hrst chromosome, for instance, 20% of the 
(absolute) correlations between the 136 CGH spots are larger than 0.6. The same holds for the 
other chromosomes. In the simulation study from Section 4, we show that the SAR algorithm 
performs much better for highly correlated data sets than the other sparse CCA techniques, that 
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impose prior covariance restrictions. This might explain why the SAR algorithm ontperforms its 
alternatives in the ont-of-sample cross-validation exercise. 

Next, we discuss the solution provided by the SAR algorithm. For each chromosome, sparse 
canonical vectors are obtained. We do not fix the number of non-zero elements in the canonical 
vectors in advance, but select the sparsity parameter using the BIC discussed in Section 3. Figure 
[^represents for each chromosome the copy number change measurements with non-zero weight^ 
Each CGH spot has a certain position on a chromosome, called the nucleotide position. The CGH 
measurements selected by the SAR algorithm are indicated by plotting a vertical line on their 
respective nucleotide position. The different colors indicate the subset of variables selected in the 
construction of the corresponding canonical variate pair (first pair: black, second: red, third: blue, 
fourth: green). 

We see from Figure that the degree of sparsity selected by the BIC varies from one chromo¬ 
some to the other. For chromosome 15, for example, only one canonical variate pair is selected 
and the BIC suggests a very sparse canonical vector. For chromosome 1, two canonical variate 
pairs are extracted with a large number of non-zero elements in the second canonical variate pair. 
However, a lot of non-zero weights are small in magnitude which can be seen from the length of 
the vertical lines. By adjusting the sparsity parameter to a higher value, a sparser solution could 
be obtained. A trade-off needs to be made between inducing more sparsity and thus performing 
better noise filtering, on the one hand, and reducing the risk of not including all important vari¬ 
ables, on the other hand. Depending on the researcher’s objective, the desired level of sparsity 
can be easily controlled by adjusting the sparsity parameter. 


6 Conclusion 


In high-dimensional settings, the estimation imprecision of traditional CCA will be large. To 
overcome this problem, penalized versions of CCA have been introduced such as the canonical 
ridge or sparse CCA. The canonical ridge still includes all variables in the canonical vectors, 
whereas sparse CCA only includes a subset of the variables. This is highly valuable in high¬ 
dimensional settings since it eases interpretation, as illustrated in the genomic data application. 


^The construction of this figure is similar to the one presented in Witten et al. (2009). We use the R-code 


available in the R package PMA (Witten et al. 2011). 
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Figure 3: SAR algorithm: copy number change measurements with non-zero weights in the hrst 
(black), the second (red), the third (blue) and the fourth (green) canonical vectors are indicated 
for each of the 23 chromosomes. 
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In this paper, we introduce a Sparse Alternating Regression (SAR) algorithm that considers 
the CCA problem from a predictive point of view. We recast the CCA problem into a penalized 
alternating regression framework to obtain sparse canonical vectors. Contrary to other popular 


sparse CCA procedures (i.e. Witten et ah, 2009 Parkhomenko et ah, 2009 Waaijenborg et al. 


2008) we do not impose any covariance restrictions. We show that the SAR algorithm produces 


much better results than the other sparse CCA approaches. Especially in simulation settings 
when there is a dependency structure within each set of variables, the gains in estimation accuracy 
achieved by the SAR algorithm are outspoken. Also in the genomic data application, the data sets 
contain highly correlated variables. We illustrate that the SAR algorithm considerably outperforms 
the other sparse CCA techniques in an out-of-sample cross-validation exercise. 


Both the SAR algorithm and the method of Waaijenborg et ah| (2008) use an alternating 
regression framework. There are, however, two important differences between both approaches. 


leading towards significant differences in performance. First, Waaijenborg et al. (2008) perform 
univariate soft thresholding, which ignores the dependency structure within each set of variables. 
In contrast, we apply the lasso penalty to multiple linear regressions. The lasso only equals the 


soft thresholding estimator for a linear model with orthonormal design (see e.g. Donoho and 


Johnstone, 1994). Secondly, we express the higher order canonical vectors in terms of the original 


data sets, whereas Waaijenborg et al. (2008) express them in terms of the deflated data matrices. 

In this paper, a lasso penalty is used to induce sparsity. Future work might consider other 
choices of penalty functions (see Prabhakar and Fridleyf 2012). For instance, the adaptive lasso 
(Zou, 2006), the smoothly clipped absolute deviation (SCAD) penalty (Fan and Li, 2001), or a 


lasso with positivity constraints (see Lykou and Whittaker, 2010). Note that Lykou and Whittaker 


( 2010| also treat CCA as a least squares problem. They focus on orthogonality properties of CCA 
and only construct the hrst two pairs of sparse canonical vectors. Their approach could be ex¬ 
tended to higher order canonical correlations, but this would increase the number of orthogonality 
constraints and the computing time substantially. 

The level of sparsity produced by all sparse CCA techniques hinges on the selection method 
used for the sparsity parameters. This might lead to substantial differences in sparsity recognition 
performance, as illustrated in the simulation study. Future work still needs to be done on the 
comparison of methods (BIC, cross-validation, measure of explained variability, among others) to 
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select the optimal tuning parameters. 
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